Feature Importance - Zachary Barnes
%run featimp.py
%matplotlib inline
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Feature importance is a set of techniques for evaluating how valuable/effective features in your data are. We loosely define two objectives for feature importance:
- Choosing features to increase model performance
- Understanding which features are influential relative to the others
This report is organized into two sections:
- Implementing feature importance strategies
- Exploring more advanced feature importance libraries
At the end we explore the feature importance packages SHAP, LIME, and PDPBOX (for partial dependence plots)
To have a way to compare techniques, we will be constraining the type of problem and dataset. We will only look at regression problems, and will be using two housing data sets.
The first dataset is the Boston housing dataset, with 13 features. Each row is a Boston suburb or town. The target y is the median value of a home in that town. This is a tiny data set, with 506 observations.
X, y = boston()
X.shape
X, y = cali()
X.shape
For a larger dataset and different features, we will also be using the California housing data set. This data was collected from the 1990 US census. This data set has 8 features, but now each row is now a census block. We are predicting the median home value for each census block.
It can be valuable to look at feature importance without a model. In this section, we will be looking at two techniques for looking at feature importance: Spearman's Rank Coefficient, Principle Component analysis.
This is a statistic is that is computed in two steps:
- Rank (order) each value in a feature
- Compute the correlation coefficient between two ranked vectors
It can be useful to see the coefficient between features in our data, but for now we will only look at the coefficient compared to our target feature.
Here is the most general definition of Spearman's rank coefficient:
$$r_s = \rho_{\operatorname{rg}_X,\operatorname{rg}_Y} = \frac{\operatorname{cov}(\operatorname{rg}_X, \operatorname{rg}_Y)} {\sigma_{\operatorname{rg}_X} \sigma_{\operatorname{rg}_Y}},$$
Where rg is the ranked vectors, cov is the covariance between those vectors, and $\sigma_{rg_{x}}$ is the standard deviation of a ranked vector.
If all values are distinct, we can use this version as a definition:
$$r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$
But this is rarely the case. We'll then stick with the more general formula.
Ranking data at first seems straightforward, but when you have duplicates, you must consider how to deal with duplicate values. There are many techniques to handle this, the most common being to assign the average of the ranks that would have been assigned to all the tied values.
Scipy has a rankdata() function that implements this functionality for us. Here is an example:
rankdata(np.array([1,2,3,4,1]))
We see that 1 was given a ranking of 1.5 while 2 was given a ranking of 3.
rankdata(np.array([1,2,1,3,4,1]))
When there is 3 ones, now we see that 1 is given a ranking of 2.
With this functionality, it is simply a matter of using numpy's corrcoef to calculate the correlation:
X, y = boston()
np.corrcoef(rankdata(X['LSTAT']), rankdata(y))
The corrcoef function returns the covariance matrix, and for the case of one feature compared with the target, we just need a value off the diagonal:
np.corrcoef(rankdata(X['LSTAT']), rankdata(y))[0][1]
Let's turn this functionality into a function, which we can now use to generate all coefficents of the features with respect to the target:
corr_coef = lambda x,y : np.corrcoef(rankdata(x), rankdata(y))[0][1]
sorted(zip(np.array([corr_coef(row,y) for row in X.values.T]), X.columns))
We now observe that some features have strong negative correlation, while others have strong positive. LSTAT, the feature with the highest absolute coefficent, stands for "lower status", the percentage of the town that is a lower status of the population (according to the data dictionary). This makes sense then that it would be negatively correlated with the median housing value, since the higher this proportion, the lower the housing value.
This is important to be aware of, but for the purposes of feature importance this is a detail that can be over looked. Therefore for our final function we take the absolute value of each coefficient but leave as an option to also look at negative correlations:
spear_rank_coef(X,y)
spear_rank_coef(X,y, abs_val=False)
This format of tuples of feature importance with the column name will be the basis for all our importances, and will work with our plot_importances function:
plot_importances(spear_rank_coef(X,y), imp_type='Spearman')
For the california housing data set, we observe Median Income to have the highest absolute correlation with the target.
X, y = cali()
plot_importances(spear_rank_coef(X,y),imp_type='Spearman')
Step of this algorithm from here
PCA is a powerful dimenstionality reduction technique, but is also useful for understanding data without a model. A very general description of PCA is that it tells us the amount of variance each feature 'explains'. The procedure is as follows:
- Normalize the data to have a mean of 0 and standard deviation of 1.
- Compute the covariance matrix sigma: $$\Sigma=\frac{1}{m}\sum_{i=1}^mx^{(i)}{x^{(i)}}^T\in\mathbb{R}^{n\times n}$$
- Compute the eigenvectors and eigenvalues of the covariance matrix.
Each eigenvector is called a Principal Component, a direction in feature space. Each eigenvalue is the amount of variance along its corresponding eigenvector.
We will use scikit learn's implementation of PCA, which decides the number of principle components to use as n_components == min(n_samples, n_features) - 1
X, y = boston()
plot_importances(pca_imp(X), imp_type='PCA')
X, y = cali()
plot_importances(pca_imp(X), imp_type='PCA')
We see that unlike spearman's rank coefficient, for the Boston data set LSTAT was not significant. For the california housing data, it agreed with spearman's rank coefficient by showing Median Income (MedInc) as significant.
Now we look at two general techniques that work for any model: drop column and permutation importance.
Here are the steps to implement drop column importance:
- Split your data into training and test data
- Fit your model on the training data
- Calculate the baseline metric on the TEST set
- For each column, drop it from the training data, and train a new model on this dropped subset.
- Predict on the test data with the fitted model, and calculate the new metric
- The difference between the baseline and new metric is the feature importance.
It is important to note that here, we are using the test set to generate our new metric to compare to the baseline. There are some compelling arguments for also using the training set to calculate this baseline: https://christophm.github.io/interpretable-ml-book/feature-importance.html (section 5.5.2)
However, for our purposes, we want to see how our model will generalize to unseen data, and for this reason, we compute our new metric using the test set.
To split our data into train and test data, we'll use sklearn's train_test_split:
X, y = boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Fit the model and caluculate baseline metric (here we use linear regression and R2 score):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
m = LinearRegression()
m.fit(X_train, y_train)
baseline = r2_score(y_test, m.predict(X_test))
baseline
Drop columns and predict on validation set, then calculate difference between baseline and new metric:
imp = []
for col in X_train.columns:
X_drop = X_train.drop(col, axis=1)
X_drop_test = X_test.drop(col, axis=1)
drop_model = LinearRegression()
drop_model.fit(X_drop,y_train)
drop_metric = r2_score(y_test, drop_model.predict(X_drop_test))
imp.append(baseline - drop_metric)
sorted(zip(imp, X.columns))
Now we have each feature in order of drop column importance:
plot_importances(sorted(zip(imp, X.columns)), imp_type='Drop')
From the plot, we see that LSTAT is significant, agreeing with what we saw with spearman's rank correlation.
Something else to note with drop column it is possible to see negative feature importances. What this means is that the model performs better when these features are dropped.
Let's compare with two different models, Random Forest and Gradient Boosting:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
plot_importances(drop_column(X,y, RandomForestRegressor, r2_score), imp_type='Drop')
plot_importances(drop_column(X, y, GradientBoostingRegressor, r2_score),imp_type='Drop')
Because of the randomness in each of these tree based models, the importances themseleves vary on each run. However, for the most part, the relative rankings remain the same. LSTAT consistently comes to the top.
X, y = cali()
plot_importances(drop_column(X, y, LinearRegression, r2_score), imp_type='Drop')
plot_importances(drop_column(X,y, RandomForestRegressor, r2_score), imp_type='Drop')
plot_importances(drop_column(X, y, GradientBoostingRegressor, r2_score), imp_type='Drop')
With the california housing data set, it is interesting to note that while LinearRegression and GradientBoostingRegression empahzie MedInc, RandomForest consistently has MedInc as the least important feature. This is an example of when using different models with the same feature importance technique can bring about different results.
Permutation importance is very similar to drop column, with the key difference being instead of dropping a column and retraining the model, the model is fit once, and predict on the versions of the test data where one column is randomly permuted row wise. This can result in some strange features for each row (males pregnancies) but has the advantage over drop column in that we only have to train the model once.
X, y = boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
m = LinearRegression()
m.fit(X_train, y_train)
baseline = r2_score(y_test, m.predict(X_test))
baseline
Permute columns and predict on validation set, then calculate difference between baseline and new metric:
imp = []
for col in X_train.columns:
save = X_test[col].copy()
X_test.loc[:,col] = np.random.permutation(X_test[col])
perm_metric = r2_score(y_test, m.predict(X_test))
X_test.loc[:, col] = save
imp.append(baseline - perm_metric)
sorted(zip(imp, X.columns))
As with drop column, we also observe the LSTAT is the most important feature defined by permuation importance.
plot_importances(sorted(zip(imp, X.columns)), imp_type='Permutation')
Comparing with Random Forest and Gradient Boosting:
plot_importances(perm_imp(X, y, RandomForestRegressor, r2_score), imp_type='Permutation')
plot_importances(perm_imp(X, y, GradientBoostingRegressor, r2_score), imp_type='Permutation')
Let's also look at the california housing data set:
X, y = cali()
plot_importances(perm_imp(X, y, LinearRegression, r2_score), imp_type='Permutation')
plot_importances(perm_imp(X, y, RandomForestRegressor, r2_score), imp_type='Permutation')
plot_importances(perm_imp(X, y, GradientBoostingRegressor, r2_score), imp_type='Permutation')
It is interesting to note that with permutation importance all models choose MedInc, latititude and longitude as the most important, where with drop column random forest identified MedInc as the least significant.
To address the inherit randomness invovled in the train test split, permutation, and randomness in certain models(Random Forest, Gradient Boosting), it is helpful to have error bars on the values for feature importance. The error bars represent 2 standard deviations from the mean of the feature importances over a given number of simulations.
To do the simulations, we bootstrap our data. This is simply sampling our data with replacement, and sampling the same number of times as the original amount of observations that we have in our dataset. Another possible technique is to use random subsampling.
To ensure that we can compare the variance of each feature, we min max scale the importances of each feature so that they are between 0 and 1.
Here we create a dictionary of features and their boostrapped importances, and save all the importances for min max scaling:
X, y = boston()
d = defaultdict(list)
all_imps = []
#bootstrap to simulate different data sets
for i in range(30):
indicies = np.random.choice(X.values.shape[0], size=X.values.shape[0], replace=True)
tup = perm_imp(X.iloc[indicies], y[indicies], RandomForestRegressor, r2_score)
for k,v in tup:
d[v].append(k)
all_imps.append(k)
Min-Max scale for comparison:
all_imps_min = np.array(all_imps).min()
all_imps_max = np.array(all_imps).max()
for k,v in d.items():
imp = np.array(v)
min_max = (imp - all_imps_min)/ (all_imps_max - all_imps_min)
d[k] = [min_max.mean(), min_max.std()]
Then, we sort our dictionary and plot the mean as the bar height and the standard deviation as the error bars:
d = sorted(d.items(), key=lambda kv: kv[1][0])
x = [tup[0] for tup in d]
mean = [tup[1][0] for tup in d]
std = [tup[1][1] for tup in d]
plt.barh(x, mean, xerr=std, capsize=3, align='center', alpha=0.8)
plt.xlabel('Importance')
Here we see that the error bars on LSTAT and RM are quite wide compared to the other features, but the minimum of the error bars does not overlap with the the other feature's mean, or error bars. This gives us more insight in to how the model and feature importance will perform over many samples.
Now let's look at the same model but with drop column:
plot_imp_err(X, y, RandomForestRegressor, r2_score, imp_type=drop_column)
With drop column, the mean importances overlap much more than permutation, and the error bars almost all overlap.
plot_imp_err(X, y, GradientBoostingRegressor, r2_score, imp_type=perm_imp)
plot_imp_err(X, y, GradientBoostingRegressor, r2_score, imp_type=drop_column)
Let's look at the california dataset:
X, y = cali()
Drop column takes a signficant amount of time, so we will only look at permutation importance:
plot_imp_err(X, y, LinearRegression, r2_score, imp_type=perm_imp)
plot_imp_err(X, y, RandomForestRegressor, r2_score, imp_type=perm_imp)
plot_imp_err(X, y, GradientBoostingRegressor, r2_score, imp_type=perm_imp)
With the larger dataset, the importances are more stable over 30 simulations, and more consistent across different models.
How can we trust that features are not giving high feature importances by chance?
One way to evalaute this is by computing empicical P-values. To do this we first compute the real feature importance. Then, for a given number of iterations, we scramble the y values and compute the feature importance on the scrambled y. We then count every time that a feature's importance is greater than the real importance, and plot this percentage. As a general cutoff, if the feature is more important than it's true importance more than 5% of time, that is cause for suspicion.
Let's first calculate the real feature importance:
X, y = boston()
d = defaultdict(list)
real_imp = perm_imp(X, y, RandomForestRegressor, r2_score)
real_imp = {v:k for k,v in real_imp} #for sorting purposes we'll want to switch the keys and values
Then compute importances on the scrambled y. 80 iterations seems to be decent:
for i in range(80):
y = np.random.permutation(y)
tup = perm_imp(X, y, RandomForestRegressor, r2_score)
for k,v in tup:
d[v].append(k)
Then calculate how many times a feature importnace was over its real importance. Numpy's where function comes in handy for this:
for k,v in real_imp.items():
d[k] = np.where(np.array(d[k]) >= v, 1, 0).sum()/80
Sort and plot our values:
d = sorted(d.items(), key=lambda kv: kv[1])
x = [tup[0] for tup in d]
y = [tup[1] for tup in d]
plt.barh(x,y)
plt.axvline(x=0.05, linestyle='--')
plt.xlabel('Empirical P-value')
According to this, only LSTAT and RM are giving importances that are not suspect.
Let's look at Gradient Boosting:
X, y = boston()
emp_pval(X, y, GradientBoostingRegressor, r2_score, imp_type=perm_imp)
It looks similar.
X, y = cali()
emp_pval(X, y, LinearRegression, r2_score, imp_type=perm_imp)
emp_pval(X, y, RandomForestRegressor, r2_score, imp_type=perm_imp)
emp_pval(X, y, GradientBoostingRegressor, r2_score, imp_type=perm_imp)
The three models seem to agree that population will have a misleading feature importance, while average number of bedrooms and average number of rooms are possible candidates.
Looking at one of the objectives for feature importance, choosing features that make a model perform best, we develop an algorithm to automatically choose which features are worth keeping based on feature importance.
The algorithm follows these steps:
- Compute a baseline metric and feature importance
- Drop the feature with the lowest feature importance
- If the metric is better than the baseline metric, then recompute the importances without the dropped feature
- Repeat steps 2 and 3 until the metric cannot be improved
- Compute a baseline metric and feature importance
X, y = boston()
To recive the baseline for each importance type, I added in a return_baseline parameter that returns this baseline.
imps, baseline = perm_imp(X, y, RandomForestRegressor, r2_score, return_baseline=True)
curr_baseline = baseline
while curr_baseline >= baseline:
new_features = [tup[1] for tup in imps[1:]]
new_imps, curr_baseline = perm_imp(X[new_features], y, RandomForestRegressor, r2_score, return_baseline=True)
if curr_baseline >= baseline:
baseline = curr_baseline
imps = new_imps
print(baseline)
plot_importances(imps)
This procedure works, but I felt because of the inherent randomness in the train test split, permuatation importance, and model randonmess, that it would be a more useful algorithm if I could repeat this process, and save the best score and amount of features.
plot_importances(auto_selection(X,y, RandomForestRegressor, r2_score, imp_type=perm_imp))
plot_importances(auto_selection(X,y, GradientBoostingRegressor, r2_score, imp_type=perm_imp))
Using the above procedure, we generally see higher validation score with more removal of uncessary features on the boston data set.
X, y = cali()
plot_importances(auto_selection(X,y, LinearRegression, r2_score, imp_type=perm_imp))
plot_importances(auto_selection(X,y, RandomForestRegressor,
r2_score, imp_type=perm_imp, iterations=10))
plot_importances(auto_selection(X,y, GradientBoostingRegressor,
r2_score, imp_type=perm_imp, iterations=10))
X.columns
In these cases, we see that Random Forest and Gradient Boosting eliniated the same two features, population and AveBedrms.
Now we will explore some other techniqeus and libraries for feature importance.
Shap is a feature importance library that claims to work for any aribtirary black box model. It's output is "Shap values", which are feature importance values generated using a game theoretic approach. The intuition for how these values are generated is then sort of a black box! Nonetheless, there is some value in comparing Shap with our feature importance techniques we have implemented above.
Most of this code is from their front page on git: https://github.com/slundberg/shap
import xgboost
import shap
shap.initjs()
X,y = shap.datasets.boston()
model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatrix(X, label=y), 100)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
(These figures don't render via pdf, but they appear in the .ipynb)
Shap is optimized for gradient boosting models. Here XGBoost creates predictions which Shap then calculates importances with. The image is telling us that LSTAT has a Shap value of 4.98, and the red indicates that feature pushing the value of the target higher.
RM on the other hand is in blue, indicating pushing the value of the target prediction lower.
shap.summary_plot(shap_values, X)
This graph shows how different Shap values for each feature interact with the target. LSTAT has values that are red with lower Shap values. Unimportant features do not have a wide spread of Shap values, i.e. CHAS at the bottom.
shap.summary_plot(shap_values, X, plot_type="bar")
This graph is the mean of the shap values for each feature, giving us perhaps the easiest way to visualize importance from the Shap library. It generally agrees with what observed with permuation and drop column, but doesn't tell us if the model will be better if we dropped some of the columns (missing the idea of negative feature importance).
import xgboost
import shap
shap.initjs()
X,y = cali()
model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatrix(X, label=y), 100)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
(These figures don't render via pdf, but they appear in the .ipynb)
Shap also agrees here with our previous techniques, showing that MedInc pushes housing value high.
shap.summary_plot(shap_values, X)
Latitude and Longitude are correctly given higher impact together, since they are correlated.
shap.summary_plot(shap_values, X, plot_type="bar")
Most of our previous techniques did not place AveOccup above latitude and longitude, so there is some difference when looking at the average Shap value plot.
Lime is feature importance library that works with any model, and uses linear models to understand the output of the model. According to their github: "Intuitively, an explanation is a local linear approximation of the model's behaviour."
The 'L' in Lime stands for Local, and what this is what differentias it from other feature importance techniques. It looks at predictions at one observation at a time, fitting a linear model to the local prediction space. Noteably, this is not a global feature importance, and is only useful when looking at one prediction at a time.
Code for the tabular model is from their tutorial
import lime
import lime.lime_tabular
boston = load_boston()
rf = RandomForestRegressor(n_estimators=1000)
train, test, labels_train, labels_test = train_test_split(boston.data, boston.target, train_size=0.80)
rf.fit(train, labels_train)
categorical_features = np.argwhere(np.array([len(set(boston.data[:,x])) for x in range(boston.data.shape[1])]) <= 10).flatten()
explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=boston.feature_names,
class_names=['price'], categorical_features=categorical_features, verbose=True, mode='regression')
i = 25
exp = explainer.explain_instance(test[i], rf.predict, num_features=5)
exp.show_in_notebook(show_table=True)